The Option-Critic Architecture
نویسندگان
چکیده
Temporal abstraction is key to scaling up learning and planning in reinforcement learning. While planning with temporally extended actions is well understood, creating such abstractions autonomously from data has remained challenging. We tackle this problem in the framework of options [Sutton, Precup & Singh, 1999; Precup, 2000]. We derive policy gradient theorems for options and propose a new option-critic architecture capable of learning both the internal policies and the termination conditions of options, in tandem with the policy over options, and without the need to provide any additional rewards or subgoals. Experimental results in both discrete and continuous environments showcase the flexibility and efficiency of the framework. Introduction Temporal abstraction allows representing knowledge about course of actions that take place at different time scales. In reinforcement learning, options (Sutton, Precup, and Singh 1999; Precup 2000) provide a framework for defining such courses of action and for seamlessly learning and planning with them. Discovering temporal abstractions autonomously has been the subject of extensive research efforts in the last 15 years (McGovern and Barto 2001; Stolle and Precup 2002; Menache, Mannor, and Shimkin 2002; Şimşek and Barto 2009; Silver and Ciosek 2012), but approaches that can be used naturally with continuous state and/or action spaces have only recently started to become feasible (Konidaris et al. 2011; Niekum and Barto 2011; Mann, Mannor, and Precup ; Mankowitz, Mann, and Mannor 2016; Kulkarni et al. 2016; Vezhnevets et al. 2016; Daniel et al. 2016). The majority of the existing work has focused on finding subgoals (useful states that an agent should reach) and subsequently learning policies to achieve them. This idea has lead to interesting methods but ones which are also difficult to scale up given their “combinatorial” flavor. Additionally, learning policies associated with subgoals can be expensive in terms of data and computation time; in the worst case, it can be as expensive as solving the entire task. We present an alternative view, which blurs the line between the problem of discovering options from that of learning options. Based on the policy gradient theorem (Sutton et al. 2000), we derive new results which enable a gradual learning process of the intra-option policies and termination functions, simultaneously with the policy over them. This approach works naturally with both linear and non-linear function approximators, under discrete or continuous state and action spaces. Existing methods for learning options are considerably slower when learning from a single task: much of the benefit will come from re-using the learned options in similar tasks. In contrast, we show that our approach is capable of successfully learning options within a single task without incurring any slowdown and while still providing re-use speedups. We start by reviewing background related to the two main ingredients of our work: policy gradient methods and options. We then describe the core ideas of our approach: the intra-option policy and termination gradient theorems. Additional technical details are included in the appendix. We present experimental results showing that our approach learns meaningful temporally extended behaviors in an effective manner. As opposed to other methods, we only need to specify the number of desired options; it is not necessary to have subgoals, extra rewards, demonstrations, multiple problems or any other special accommodations (however, the approach can work with pseudo-reward functions if desired). To our knowledge, this is the first end-to-end approach for learning options that scales to very large domains at comparable efficiency. Preliminaries and Notation A Markov Decision Process consists of a set of states S, a set of actionsA, a transition function P : S×A → (S → [0, 1]) and a reward function r : S × A → R. For convenience, we develop our ideas assuming discrete state and action sets. However, our results extend to continuous spaces using usual measure-theoretic assumptions (some of our empirical results are in continuous tasks). A (Markovian stationary) policy is a probability distribution over actions conditioned on states, π : S × A → [0, 1]. In discounted problems, the value function of a policy π is defined as the expected return: Vπ(s) = Eπ [ ∑∞ t=0 γ rt+1 | s0 = s] and its action-value function as Qπ(s, a) = Eπ [ ∑∞ t=0 γ rt+1 | s0 = s, a0 = a], where γ ∈ [0, 1) is the discount factor. A policy π is greedy with respect to a given action-value function Q if π(s, a) > 0 iff a = argmaxa′ Q(s, a ′). In a discrete MDP, there is at least one optimal policy which is greedy with rear X iv :1 60 9. 05 14 0v 1 [ cs .A I] 1 6 Se p 20 16 spect to its own action-value function. Policy gradient methods (Sutton et al. 2000; Konda and Tsitsiklis 2000) address the problem of finding a good policy by performing stochastic gradient descent to optimize a performance objective over a given family of parametrized stochastic policies, πθ. The policy gradient theorem (Sutton et al. 2000) provides expressions for the gradient of the average reward and discounted reward objectives with respect to θ. In the discounted setting, the objective is defined with respect to a designated start state (or distribution) s0: ρ(θ, s0) = Eπθ [ ∑ t=0 γ rt+1 | s0]. The policy gradient theorem shows that: ∂ρ(θ,s0) ∂θ = ∑ s μπθ (s | s0) ∑ a ∂πθ(a|s) ∂θ Qπθ (s, a), where μπθ (s | s0) = ∑∞ t=0 γ t P (st = s | s0) is a discounted weighting of the states along the trajectories starting from s0. In practice, the policy gradient is estimated from samples along the on-policy stationary distribution. (Thomas 2014) showed that neglecting the discount factor in this stationary distribution makes the usual policy gradient estimator biased. However, correcting for this discrepancy also reduces data efficiency. For simplicity, we build on the framework of (Sutton et al. 2000) and discuss how to extend our results according to (Thomas 2014). The options framework (Sutton, Precup, and Singh 1999; Precup 2000) formalizes the idea of temporally extended actions. A Markovian option ω ∈ Ω is a triple (Iω, πω, βω) in which Iω ⊆ S is an initiation set, πω is an intra-option policy, and βω : S → [0, 1] is a termination function. We also assume that ∀s ∈ S,∀ω ∈ Ω : s ∈ Iω (i.e., all options are available everywhere), an assumption made in the majority of options discovery algorithms. We will discuss how to dispense with this assumption in the final section. (Sutton, Precup, and Singh 1999; Precup 2000) show that an MDP endowed with a set of options becomes a Semi-Markov Decision Process (Puterman 1994, chapter 11), which has a corresponding optimal value function over options VΩ(s) and option-value function QΩ(s, ω). Learning and planning algorithms for MDPs have their counterparts in this setting. However, the existence of the underlying MDP offers the possibility of learning about many different options in parallel : the idea of intra-option learning, which we leverage in our work. Learning Options We adopt a continual perspective on the problem of learning options. At any time, we would like to distill all of the available experience into every component of our system: value function and policy over options, intra-option policies and termination functions. To achieve this goal, we focus on learning option policies and termination functions, assuming they are represented using differentiable parameterized function approximators. We consider the call-and-return option execution model, in which an agent picks option ω according to its policy over options πΩ , then follows the intra-option policy πω until termination (as dictated by βω), at which point this procedure is repeated. Let πω,θ denote the intra-option policy of option ω parametrized by θ and βω,θ, the termination function of ω parameterized by θ. We present two new results for learning options, obtained using as blueprint the policy gradient theorem (Sutton et al. 2000). Both results are derived under the assumption that the goal is to learn options that maximize the expected return in the current task. However, if one wanted to add extra information to the objective function, this could readily be done so long as it comes in the form of an additive differentiable function. Suppose we aim to optimize directly the discounted return, expected over all the trajectories starting at a designated state s0 and option ω0, then: ρ(Ω, θ, θ, s0, ω0) = EΩ,θ,ω [ ∑∞ t=0 γ rt+1 | s0, ω0]. Note that this return depends on the policy over options, as well as the parameters of the option policies and termination functions. We will take gradients of this objective with respect to θ and θ. In order to do this, we will manipulate equations similar to those used in intra-option learning (Sutton, Precup, and Singh 1999, section 8). Specifically, the definition of the option-value function can be written as: QΩ(s, ω) = E Ω,θ,θ [ ∞ ∑ t=0 γrt+1 ∣∣∣∣ s0 = s, ω0 = ω ]
منابع مشابه
On Adaptive Critic Architectures in Feedback Control
Two feedback control systems are designed that employ the adaptive critic architecture, which consists of two neural networks, one of which (the critic) tunes the other. The first application is a deadzone compensator, where it is shown that the adaptive critic structure is a natural consequence of the mathematical problem of inversion of an unknown function. In this situation the adaptive crit...
متن کاملThe Eigenoption-Critic Framework
Eigenoptions (EOs) have been recently introduced as a promising idea for generating a diverse set of options through the graph Laplacian, having been shown to allow efficient exploration Machado et al. [2017a]. Despite its first initial promising results, a couple of issues in current algorithms limit its application, namely: 1) EO methods require two separate steps (eigenoption discovery and r...
متن کاملLearnings Options End-to-End for Continuous Action Tasks
We present new results on learning temporally extended actions for continuous tasks, using the options framework (Sutton et al. [1999b], Precup [2000]). In order to achieve this goal we work with the option-critic architecture (Bacon et al. [2017]) using a deliberation cost and train it with proximal policy optimization (Schulman et al. [2017]) instead of vanilla policy gradient. Results on Muj...
متن کاملApplying the Episodic Natural Actor-Critic Architecture to Motor Primitive Learning
In this paper, we investigate motor primitive learning with the Natural Actor-Critic approach. The Natural Actor-Critic consists out of actor updates which are achieved using natural stochastic policy gradients while the critic obtains the natural policy gradient by linear regression. We show that this architecture can be used to learn the “building blocks of movement generation”, called motor ...
متن کاملA novel approach to locomotion learning: Actor-Critic architecture using central pattern generators and dynamic motor primitives
In this article, we propose an architecture of a bio-inspired controller that addresses the problem of learning different locomotion gaits for different robot morphologies. The modeling objective is split into two: baseline motion modeling and dynamics adaptation. Baseline motion modeling aims to achieve fundamental functions of a certain type of locomotion and dynamics adaptation provides a "r...
متن کاملOnActor-Critic Algorithms
In this article, we propose and analyze a class of actor-critic algorithms. These are two-time-scale algorithms in which the critic uses temporal difference learning with a linearly parameterized approximation architecture, and the actor is updated in an approximate gradient direction, based on information provided by the critic. We show that the features for the critic should ideally span a su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017